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O ' Abstract 

•^ I The problem of learning forest-structured discrete graphical models from i.i.d. samples 



is considered. An algorithm based on pruning of the Chow-Liu tree through adaptive 
thresholding is proposed. It is shown that this algorithm is both structurally consistent 
and risk consistent and the error probability of structure learning decays faster than any 
polynomial in the number of samples under fixed model size. For the high-dimensional 
S^ i scenario where the size of the model d and the number of edges k scale with the number of 

5-H ' samples n, sufficient conditions on (n, d, k) are given for the algorithm to satisfy structural 

and risk consistencies. In addition, the extremal structures for learning are identified; we 
prove that the independent (resp. tree) model is the hardest (resp. easiest) to learn using 
the proposed algorithm in terms of error rates for structure learning. 

Keywords: Graphical models, Forest distributions. Structural consistency. Risk consis- 
tency. Method of types. 

1. Introduction 

Graphical models (also known as Markov random fields) have a wide range of a pplications 



i n div er se fields such as signal p r ocessi ng, coding theory and bioinformatics. See iLauritzen 



( 19961 ). IWainwright and Jordan! ( 20031 ) and references therein for examples. Inferring the 



structure and parameters of graphical models from samples is a starting point in all these 
applications. The structure of the model provides a quantitative interpretation of relation- 
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ships amongst the given collection of random variables by specifying a set of conditional 
independence relationships. The parameters of the model quantify the strength of these 
interactions among the variables. 

The challenge in learning graphical models is often compounded by the fact that typically 
only a small number of samples are available relative to the size of the model (dimension 
of data). This is referred to as the high-dimensional learning regime, which differs from 
classical statistics where a large number of samples of fixed dimensionality are available. 
As a concrete example, in order to analyze the effect of environmental and genetic factors 
on childhood asthma, clinician scientists in Manchester , UK have been cond u cting a lon- 
gitudinal birth-cohort study since 1997 (jCustovic et al.l , I2OO2I : ISimpson et al.l . l20ld ). The 



number of variables collected is of the order of d ~ 10^ (dominated by the genetic data) 
but the number of children in the study is small (n ~ 10'^). The paucity of subjects in 
the study is due in part to the prohibitive cost of collecting high-quality clinical data from 
willing participants. 

In order to learn high-dimensional graphical models, it is imperative to strike the right 
balance between data fidelity and overfitting. To a .meliorate the effect of overfitt ing, the 
samples are often fitted to a sparse graphical model ( Wainwright and Jordanl . l2003l ) . with 



a 



small number of edges. One popular and tractable class of sparse grap hical models is t he set 
of treqjj models. When r estricted to trees, the Chow-Liu algorithm (jChow and Liul . 119681 : 



Chow and Wagneii . Il973l ) provides an efficient implementation of the maximum-likelihood 



(ML) procedure to learn the structure from independen t samples. How ever, in the high- 
dimensional regime, even a tree may overfit the data ( Liu et al.l . 12010 ). In this paper. 



we consider learning high-dimensional, forest-structured (discrete) graphical models from a 
given set of samples. 

For learning the forest structure, the ML (Chow-Liu) algorithm does not produce a 
consistent estimate since ML favors richer model classes and hence, outputs a tree in general. 
We propose a consistent algorithm called CLThres, which has a thresholding mechanism to 
prune "weak" edges from the Chow-Liu tree. We provide tight bounds on the overestimation 
and underestimation errors, that is, the error probability that the output of the algorithm 
has more or fewer edges than the true model. 

1.1 Main Contributions 

This paper contains three main contributions. Firstly, we propose an algorithm named 
CLThres and prove that it is structurally consistent when the true distribution is forest- 
structured. Secondly, we prove that CLThres is risk consistent, meaning that the risk under 
the estimated model converges to the risk of the forest projectior^ of the underlying distri- 
bution, which may not be a forest. We also provide precise convergence rates for structural 
and risk consistencies. Thirdly, we provide conditions for the consistency of CLThres in the 
high-dimensional setting. 

We first prove that CLThres is structurally consistent, i.e., as the number of samples 
grows for a fixed model size, the probability of learning the incorrect structure (set of edges). 



1. A tree is a connected, acyclic graph. We use the term proper forest to denote the set of disconnected, 
acyclic graphs. 

2. The forest projection is the forest-structured graphical model that is closest in the KL-divergence sense 
to the true distribution. We define this distribution formally in (|13p . 
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decays to zero for a fixed model size. We sliow that the error rate is in fact, dominated by 
the rate of decay of the overestimation err or probabilityQ We use an information-theoretic 
technique known as the method of types ( Cover and Thomaa . l2006l . Ch. 11) as well as a 
recen tly-developed technique known as Euclidean information theory (JBorade and Zheng . 
2008 ) . We provide an upper bound on the error probability by using convex duality to 
find a surprising connection between the overestimation error rate and a semidefinite pro- 
gram ( Vandenberghe and Boydl . Il996l ) and show that the overestimation error in structure 
learning decays faster than any polynomial in n for a fixed data dimension d. 

We then consider the high-dimensional scenario and provide sufficient conditions on the 
growth of (n, d) (and also the true number of edges k) to ensure that CLThres is structurally 
consistent. We prove that even if d grows faster than any polynomial in n (and in fact 
close to exponential in n), structure estimation remains consistent. As a corollary from 
our analyses, we also show that for CLThres, independent models (resp. tree models) are 
the "hardest" (resp. "easiest") to learn in the sense that the asymptotic error rate is the 
highest (resp. lowest), over all models with the same scaling of (n, d). Thus, the empty 
graph and connected trees are the extremal forest structures for learning. We also prove 
that CLThres is risk consistent, i.e., the risk of the estimated forest distribution converges 
to the risk of the forest projection of the true model at a rate of Op(dlo Rd/n^~'') ioi aii y 
7 > 0. We compare and contrast this rate to existing results such as those iLiu et al.l ( 20ld ). 
Note that for this result, the true probability model does not need to be a forest-structured 
distribution. Finally, we use CLThres to learn forest-structured distributions given synthetic 
and real- world datasets and show that in the finite-sample case, there exists an inevitable 
trade-off between the underestimation and overestimation errors. 



1.2 Related Work 

There are many papers that discu s s the learning of g r aphical models fr o m da t a. See 

Dudik et al.l(l2004l).lLee et al.l(|2006l ). lAbbeel et alll2006l ) . Iwainwright et~all (|2006l ) . [Meinshausen and Buehlm 
( 20061 ) ■ I Johnson et al.l ( 20071 ). and references therein. Most of these methods pose the learn- 
ing problem as a parameterized convex optimization problem, typically with a regulariza- 
tion term to enforce sparsity in the learned graph. Consistency guarantees in terms of n 
and d (and possibly the maximum degree) are pr ovided. Information-theoretic limits for 
learning g r aphic al models have also been derived in lSanthanam and Wainwrightl ( 20081 ). In 
Zuk et al.l ( 20061 ). bounds on the error rate for learning the struct ure of Bayesian ne tworks 
using the Bayesian Information Criterion (BIC) were provided. iBach and Jordan! (|2003. ) 
learned tree-structured models for solving the independent compo nent analysis (ICA) prob- 
l em. A PAC analysis for learning thin junction trees was given in lChechetka and Guestrin 



( 20071 ). iMeila and JordanI (120001 ) discussed the learning of graphical models from a differ- 



ent perspective; namely that of learning mixtures of trees via an expectation-maximization 
procedure. 



By using the theory of large-deviations (JDembo and Zeitounil. Il998l). w e derived and 
analyzed the error exponent for learning trees for discrete (JTan et al.l . l201ll ) and Gaussian 
( Tan et al.l . l2010al ) graphical models. The error exponent is a quantitative measure of 



3. The overestimation error probability is the probabihty that the number of edges learned exceeds the true 
number of edges. The underestimation error is defined analogously. 
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performance of the learning algorithm since a larger exponent implies a faster decay of the 
error probability. However, the analysis does not readily extend to learning forest models. 
In addition, we posed the structur e learning problem for trees as a composite hypothesis 
testing problem ( Tan et al.l . l2010bl ) and derived a closed- form expression for the Chernoff- 
Stein exponent in term s of the mu t ual in formation on the bottleneck edge. In a paper that is 
closely related to ours, iLiu et al.l (J2010|) derived consistency (and sparsistency) guarantees 
for learning tree and forest models. The pairwise joint distributions are modeled using 
kernel density estimates, where the kernels are Holder continuous. This differs from our 
approach since we assume that each variable can only take finitely many values, leading to 
stronger results on error rates for structure learning via the method of types, a powerful 
proof technique in information theory and statistics. We compare our convergence rates 
to these related works in Section [6j Furthermore, the algorithm suggested in both papers 
uses a subset (usually half) of the dataset to learn the full tree model and then uses the 
remaining subset to prune the model based on the log- likelihood on the held-out set. We 
suggest a more direct and consistent method based on thresholding, which uses the entire 
dataset to learn and prune the model without recourse to validation on a he l d-out dataset. 
It is well known that validation is both computationally expensive (JBishopl . |2008| . pp. 33) 
and a pote ntial wast e of va luable data which may otherwise be employed to learn a better 
model. In iLiu et al.l ( 20101 ). the problem of estimating forests with restricted component 
sizes was considered and was proven to be NP-hard. We do not restrict the component size 
in this paper but instead attempt to learn the model with the minimum number of edges 
which best fits the data. 



Our work is also related to and inspired by the vast body of literature in information 
theory and statistics on Markov order estimation. In these works, the authors use various 
regularization and model selection , schemes to find the optimal or der of a Markov chain 
( Merhav et al.l.ll989l:lFinesso et al.l . Il996l : ICsiszar and Shieldsl. 200d) . hidden Markov model 
( Gassiat and Boucheronl . l2003l ) or exponential family ( Merhavl . ll989l ). We build on some of 
these ideas and proof techniques to identify the correct set of edges (and in particular the 
number of edges) in the forest model and also to provide strong theoretical guarantees of 
the rate of convergence of the estimated forest-structured distribution to the true one. 



1.3 Organization of Paper 



This paper is organized as follows: We define the mathematical notation and formally state 
the problem in Section [2j In Section [3l we describe the algorithm in full detail, highlighting 
its most salient aspect - the thresholding step. We state our main results on error rates for 
structure learning in Section [5] for a fixed forest-structured distribution. We extend these 
results to the high-dimensional case when (n, d, k) scale in Section [5l Extensions to rates of 
convergence of the estimated distribution, i.e., the order of risk consistency, are discussed 
briefiy in Section [6l Numerical simulations on synthetic and real data are presented in 
Section [71 Finally, we conclude the discussion in Section [8l The proofs of the majority of 
the results are provided in the appendices. 
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2. Preliminaries and Problem Formulation 

Let G = iy, E) be an undirected graph with vertex (or node) set F := {1, . . . , d} and edge 
set E C (2) and let nbd(i) := {j ^ V : {i,j) G E} be the set of neighbors of vertex i. 
Let the set of labeled trees (connected, acyclic graphs) with d nodes be T*^ and let the set 
of forests (acyclic graphs) with k edges and d nodes be T^'^ for < A; < d — 1. The set of 
forests includes all the trees. We reserve the term proper forests for the set of disconnected 
acylic graphs ^J^^qT^. We also use the notation T'^ := Uj^'ZqT^ to denote the set of labeled 
forests with d nodes. 

A graphical model i Lauritzenl . 119961 ) is a family of multivariate probability distribu- 



tions (probability mass functions) in which each distribution factorizes according to a 
given undirected graph and where each variable is associated to a node in the graph. Let 
X = {1, ■ ■ ■ ,r} (where 2 < r < cx)) be a finite set and X'^ the d-fold Cartesian product of 
the set X. As usual, let V{X'^) denote the probability simplex over the alphabet X'^. We 
say that the random vector X = {Xi, . . . ,Xd) with distribution Q G ■p(A"^) is Markov on 
the graph G = {V, E) if 

Q{xi\x^\,A(i)) =Q{xi\xv\i), ViGl/, (1) 

where xy\i is the collection of va riables excluding variable i. Eq. ([T]) is known as the local 
Markov property ( Lauritzenl . 1 19961 ) . In this paper, we always assume that graphs are minimal 



representations for the corresponding graphical model, i.e., if Q is Markov on G, then G 
has the smallest number of edges for the conditional independence relations in ([TJ to hold. 
We say the distribution Q is a forest- structured distribution if it is Markov on a forest. We 
also use the notation D{T^) C V{X ) to denote the set of d-variate distributions Markov 
on a forest with k edges. Similarly, 'D{J^'^) is the set of forest-structured distributions. 

Let P € T^{T^) be a discrete forest-structured distribution Markov on Tp = {V, Ep) € 
T^ (for some k = 0, . . . ,d — 1). It is known that the joint distribution P factorizes as 



follows (JLauritzenl . Il99' 



p(x)=n^^(-o n p'£^.y (2) 

iev ii,j)eEp'^'^^''^^^''^^ 

where {Pjjigy and {Pi,j}{i,j)eEp ai'6 the node and pairwise marginals which are assumed 
to be positive everywhere. 

The mutual information (MI) of two random variables Xi and Xj with joint distribution 
Pjj is the function /(•) : V{X'^) — )■ [0,oo) defined as 






npi.i)-= 2. j'.jfe.^i)i°g o;'i\;',i\ . (3) 



This n otation for mutual information differs from the usual I{Xi; Xj) used in lCover and Thomas 



S); we emphasize the dependence of / on the joint distribution P,, The m^n^mum mu 



tual information in the forest, denoted as /mm := ^''^{i,j)eEp H^ij) ^^^^ turn out to be a 
fundamental quantity in the subsequent analysis. Note from our minimality assumption 
that /mill > since all edges in the forest have positive mutual information (none of the 
edges are degenerate). When we consider the scenario where d grows with n in Section [5l 
we assume that /min is uniformly bounded away from zero. 
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2.1 Problem Statement 

We now state the basic problem formally. We are given a set of i.i.d. samples, denoted as 
x" := {xi,...,x„}. Each sample x; = {xi^i, . . . ,xi^(i) € X'^ is drawn independently from 
P G T^O'k) ^ forest-structured distribution. From these samples, and the prior knowledge 
that the undirected graph is acyclic (but not necessarily connected), estimate the true set 
of edges Ep as well as the true distribution P consistently. 

3. The Forest Learning Algorithm: CLThres 

We now describe our algorithm for estimating the edge set Ep and the distribution P. This 
algorithm is a modification of the celebrated Chow-Liu algor i thm for maximum-likelihood 
(ML) learning of tree-structured distributions ( Chow and Liul . ll968l ). We call our algorithm 



CLThres which stands for Chow-Liu with Thresholding. 

The inputs to the algorithm are the set of samples x" and a regularization sequence 
{ffnlneN (to be specified precisely later) that typically decays to zero, i.e., \\in.n^oo^n = 
0. The outputs are the estimated edge set, denoted Et , and the estimated distribution, 
denoted P* . 

1. Given x", calculate the set of pairwise empirical distribution^ (or pairwise types) 
{Pjjlijgy. This is just a normalized version of the counts of each observed symbol 
in A!^ and serves as a set of sufficient statistics for the estimation problem. The 
dependence of Pij on the samples x" is suppressed. 

2. Form the set of empirical mutual information quantities: 

-Mj yXi, Xj ) 



I{Pij):= ^ Pij{xi,Xj)\oi 



-'i V'^* J -'j v^j ) 



for 1 < i, J < d. This is a consistent estimator of the true mutual information in ([3]). 
3. Run a max-weight spanning tree (MWST) algorithm ( Priml . 119571 : iKruskall . Il956l ) to 



obtain an estimate of the edge set: 

Ed-i := argmax V] /(Pjj). 

Let the estimated edge set be £"^-1 '■= {ei, . . . ,6^-1} where the edges Cj are sorted 
according to decreasing empirical mutual information values. We index the edge set 
by d — 1 to emphasize that it has d — 1 edges and hence is connected. We denote 
the sorted empirical mutual information quantities a s IjPp^) > . . . > I(Pf ^ ^)- These 



first three steps constitute the Chow-Liu algorithm (jChow and Liul . 11968 
4. Estimate the true number of edges using the thresholding estimator. 

kn := argmin |/(Pe, ) : I{Pe, )>£«,, I{Pe,+, )<£«)• (4) 



4. In this paper, the terms empirical distnbution and type are used interchangeably. 
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If there exists an empirical mutual information I{Pe) such that I{Pe) = En; break 
the tie arbitrarily|f| 

5. Prune the tree by retaining only the top kn edges, i.e., define the estimated edge set 
of the forest to be 

Ez ■= {ei,...,% }, 

where {cj : 1 < i < d — 1} is the ordered edge set defined in Step [3l Define the 
estimated forest to be Tr := (V, Er ). 

6. Final ly, define the estimated distribution P* to be the reverse I-projection ( Csiszar and Matua . 
2003 ) of the joint type P onto Tr , i.e., 



P*(x) := argniin D{P\\Q). 

It can easily be shown that the projection can be expressed in terms of the marginal 
and pairwise joint types: 



p*(x)=n^.(^.) n 



iev 



i^,j)eE. 



^i,j\Xi^ Xj) 



Intuitively, CLThres first constructs a connected tree {V,E^_i) via Chow-Liu (in Steps [T] 
- E]) before pruning the weak edges (with small mutual information) to obtain the final 
structure E-r . The estimated distribution P* is simply the ML estimate of the parameters 

subject to the constraint that P* is Markov on the learned tree Tt . 

Note that if Step [His omitted and kn is defined to be d — 1, then CLThres simply reduces 
to the Chow-Liu ML algorithm. Of course Chow-Liu, which outputs a tree, is guaranteed 
to fail (not be structurally consistent) if the number of edges in the true model k < d — 1, 
which is the problem of interest in this paper. Thus, Step [H a model selection step, is 
essential in estimating the true number of edges k. This step is a gener alization of the 
test for independence of discrete memoryless sources discussed in iMerhavl (|l989 ). In our 
work, we exploit the fact that the empirical mutual information I{Pe) corresponding to 
a pair of independent variables ej will be very small when n is large, thus a thresholding 
procedure using the (appropriately chosen) regularization sequence {£n} will remove these 
edges. In fact, the subsequent analysis allows us to conclude that Step HI in a formal sense, 
dominates the error probability in structure learning. CLThres is also efficient as shown by 
the following result. 

Proposition 1 (Complexity of CLThres) CLThres runs in time 0{{n + log d)d'^). 

Proof The computation of the sufficient statistics in Steps \T\ and [2] requires Ojnd^) opera- 
tions. The MWST algorithm in Step [3] requires at most 0(d^ logd) operations ( Priml . 119571 ). 



5. Here were allow a bit of imprecision by noting that the non-strict inequalities in Q simplify the sub- 
sequent analyses because the constraint sets that appear in optimization problems will be closed, hence 
compact, insuring the existence of optimizers. 
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Steps m and [5] simply require the sorting of the empirical mutual information quantities on 
the learned tree which only requires 0{logd) computations. ■ 



4. Structural Consistency For Fixed Model Size 

In this section, we keep d and k fixed and consider a probability model P, which is assumed 
to be Markov on a forest in Tjf. This is to gain better insight into the problem before we 
analyze the high-dimensional scenario in Section [5] where d and k scal^j with the sample 
size n. More precisely, we are interested in quantifying the rate at which the probability of 
the error event of structure learning 

An := {x" G {X^r ■■ \ (x") ^ Ep] (5) 

decays to zero as n tends to infinity. Recall that Et , with cardinality kn , is the learned edge 
set by using CLThres. As usual, P" is the n-fold product probability measure corresponding 
to the forest-structured distribution P. 

Before stating the main result of this section in Theorem [3l we first state an auxiliary re- 
sult that essentially says that if one is provided with oracle knowledge of Imiii) the minimum 
mutual information in the forest, then the problem is greatly simplified. 

Proposition 2 (Error Rate with knowledge of /min) Assume that Imin is known in 
CLThres. Then by letting the regularization sequence be e„ = -^min/2 for all n, we have 

hm -logP"(^„)<0, (6) 

n— >oo 77, 

i.e., the error probability decays exponentially fast. 

The proof of this theorem and all other results in the sequel can be found in the appendices. 
Thus, the primary difficulty lies in estimating /min or equivalently, the number of edges 
k. Note that if k is known, a simple modification to the Chow-Liu procedure by imposing 
the constraint that the final structure contains k edges will also yield exponential decay as 
in ([6]). However, in the realistic case where both /min and k are unknown, we show in the 
rest of this section that we can design the regularization sequence e^ in such a way that the 
rate of decay of P'^{An) decays almost exponentially fast. 

4.1 Error Rate for Forest Structure Learning 

We now state one of the main results in this paper. We emphasize that the following result 
is stated for a fixed forest-structured distribution P G T)(T^) so d and k are also fixed 
natural numbers. 

Theorem 3 (Error Rate for Structure Learning) Assume that the regularization se- 
quence {CnlnGN satisfies the following two conditions: 

lim en = 0, lim - — — = cx). (7) 

n— i>oo ra— i>cxD log n 



6. In that case P must also scale, i.e., we learn a family of models as d and k scale. 
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Figure 1: Graphical interpretation of the condition on e„. As n — ?> oo, the regularization 
sequence e„ will be smaller than Imin and larger than I{Q^-) with high probability. 



Then, if the true model Tp = (V, Ep) is a proper forest (k < d — 1), there exists a constant 
Cp € (1,00) such that 



-Cp < lim inf log P"(X) 

n->oo nen 

< lim sup logP"(A) < -1- 

n— >oo ri£n 

Finally, if the true model Tp = {V, Ep) is a tree (k = d — 1), then 

lim -logP"(A)<0, 

n— J-oo n 

i.e., the error probability decays exponentially fast. 



(8) 
(9) 

(10) 



4.2 Interpretation of Result 

From Q , the rate of decay of the error probability for proper forests is subexponential but 
nonetheless can be made faster than any polynomial for an appropriate choice of £«• The 
reason for the subexponential rate is because of our lack of knowledge of Imin, the minimum 
mutual information in the true forest Tp. For trees, the ratqil is exponential (= exp(— nF) 
for some positive constant F). Learning proper forests is thus, strictly "harder" than 
learning trees. The condition on e„ in ([7]) is needed for the following intuitive reasons: 

1. Firstly, ([7]) ensures that for all sufficiently large n, we have e„ < -^min- Thus, the 
true edges will be correctly identified by CLThres implying that with high probability, 
there will not be underestimation as n — )• 00. 



7. We use the asymptotic notation from information theory = to denote equality to first order in the 
exponent. More precisely, for two positive sequences {a„}„gN and {6n}n6N we say that a„ = 6„ iff 
lim„^oo n~^ log(a„/fe„) = 0. 
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2. Secondly, for two independent random variables Xi and Xj with distribution Qi^j = 
QiQj, the sequence! cr(/(Q")) = 6(l/n), where Q" • is the joint empirical distribution 
of n i.i.d. samples drawn from Qij. Since the regularization sequence e„ = a;(logn/n) 
has a slower rate of decay than a{I{Q'^A), Sn > HQ^j) with high probability as 
n — )> oo. Thus, with high probability there will not be overestimation as n ^ cx). 

See Figure [1] for an illustration of this intuition. The formal proof follows from a method 
of types argument and we provide an outline in Section 14.31 A convenient choice of e^ that 
satisfies ([7]) is 

en:=n-^, V/3g(0,1). (11) 

Note further that the upper bound in ([9]) is also independent of P since it is equal to —1 
for all P. Thus, ([9]) is a unz?;ersa/ result for all forest distributions P G D^T ). The intuition 
for this universality is because in the large-n regime, the typical way an error occurs is due 
to overestimation. The overestimation error results from testing whether pairs of random 
variables are independent and our asymptotic bound for the error probability of this test 
does not depend on the true distribution P. 

The lower bound Cp in ([8]), defined in the proof in Appendix |Bl means that we cannot 
hope to do much better using CLThres if the original structure (edge set) is a proper forest. 
Together, ^ and Q imply that the rate of decay of the error probability for structure 
learning is tight to within a constant factor in the exponent. We believe that the error rates 
given in Theorem [3] cannot, in general, be improved without knowledge of /mm- We state 
a converse (a necessary lower bound on sample complexity) in Theorem [7] by treating the 
unknown forest graph as a uniform random variable over all possible forests of fixed size. 

4.3 Proof Idea 

The rnethod of proof for Theorem [3] involves using the Gallager-Fano bounding technique 
( Fand . ll96ll . pp. 24) and the union bound to decompose the overall error probability P^{An) 



into three distinct terms: (i) the rate of decay of the error probability for learning the top 
k edges (in terms of the mutual information quantities) correctly - known as the Chow-Liu 
error, (ii) the rate of decay of the overestimation error {kn > k} and (iii) the rate of decay of 
the unde restimation error {kn < k} . Each of these terms is upper bounded using a method 



of types (jCover and Thomasl . l2006l . Ch. 11) arg ument. It turns out , as is the case with 



the literature on Markov order estimation (e.g., iFinesso et al.l (119961 )) . that bounding the 



overestimation error poses the greatest challenge. Indeed, we show that the underestimation 
and Chow-Liu errors have exponential decay in n. However, the overestimation error has 
subexponential decay {■^ exp(— ne„)). 

The main tec hnique used to analyze th e overestimation error relies on Euclidean in- 
formation theory ( Borade and Zhend . l2008l ) which states that if two distributions uq and 



z^i (both supported on a common finite alphabet 3^) are close entry-wise, then various 



8. The notation ij{Z) denotes the standard deviation of the random variable Z. The fact that the standard 
deviation of the empirical MI (j{I{Q^,j)) decays as 1/n can be verified by Taylor expandin g I{QYj) 
aroun d Qij = QiQj and using the fact that the ML estimate converges at a rate of n~^'^ l|Serflind . 
Il980l ). 

10 
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information-theoretic measures can be approximated locally by quantities related to Eu- 
clidean norms. For example, the KL-divergence D{uq\\i'i) can be approximated by the 
square of a weighted Euclidean norm: 

»(^.iki)=^E "^°'°i",:;''"'' +°(iN--'iii^)- (12) 



aey 



Note that if uq ~ ui, then -D(i^o II ^i) is close to the sum in (|12p and the o(\\ i'n — z^i 



terrn can be neglected. Using this approx imation and Lagrangian duality (jBertseka! 



|2 



i 



I999I ). we reduce a non-convex I-projection (jCsiszar and Matua . l2003l ) problem involving 



infor mation-theoretic quantities (such as divergence) to a relatively simple semidefinite pro- 



( Vandenberghe and BovdI . Il996l ) which admits a closed- form solution. Furthermore, 



gram 

the approximation in (|12|) becomes exact as n —^ 00 (i.e., £n ^0), which is the asymptotic 

regime of interest. The full details of the proof can be found Appendix [Bl 

4.4 Error Rate for Learning the Forest Projection 

In our discussion thus far, P has been assumed to be Markov on a forest. In this subsec- 
tion, we consider the situation when the underlying unknown distribution P is not forest- 
structured but we wish to learn its best forest approximation. To this end, we define the 
projection of P onto the set of forests (or forest projection) to be 

P := argmin D{P\\Q). (13) 

If there are multiple optimizing distribution, choose a projection P that is minimal, i.e., its 
graph Tp = {V, Ep) has the fewest number of edges such that (fT3l) holds. If we redefine the 
event An in ([5]) to be An '■= {Et 7^ Ep}^ we have the following analogue of Theorem [3l 

Corollary 4 (Error Rate for Learning Forest Projection) Let P he an arbitrary dis- 
tribution and the event An be defined as above. Then the conclusions in ^ - (jlOp in 
Theorem\^ hold if the regularization sequence {£n}n&] satisfies d?]). 

5. High-Dimensional Structural Consistency 

In the previous section, we considered learning a fixed forest-structured distribution P (and 
hence fixed d and k) and derived bounds on the error rate for structure learning. However, 
for most problems of practical interest, the number of data samples is small compared to the 
data dimension d (see the asthma example in the introduction). In this section, we prove 
sufficient conditions on the scaling of {n,d,k) for structure learning to remain consistent. 
We will see that even if d and k are much larger than n, under some reasonable regularity 
conditions, structure learning remains consistent. 

5.1 Structure Scaling La^v 

To pose the learning problem formally, we consider a sequence of structure learning problems 
indexed by the number of data points n. For the particular problem indexed by n, we have a 
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dataset x" = (xi, . . . , x„) of size n where each sample x/ € X is drawn independently from 
an unknown d-variate forest-structured distribution P^'^' € 'D(Tj^), which has d nodes and 
k edges and where d and k depend on n. This high-dimensional setup allows us to model 
and subsequently analyze how d and k can scale with n while maintaining consistency. We 
will sometimes make the dependence of d and k on n explicit, i.e., d = dn and k = kn- 
In order to be able to learn the structure of the models we assume that 

(Al) Ii„f:=inf min /(Pg^) > 0, (14) 

den (tj)eE^(d) '■' 

(A2) K;:=inf min p/f (xi,x,) > 0. (15) 

That is, assumptions (Al) and (A2) insure that there exists uniform lower bounds on the 
minimum mutual information and the minimum entry in the pairwise probabilities in the 
forest models as the size of the graph grows. Thes e are ty pical regularity assumptions for the 



high-dimensional setting. See IWainwright et al.l (120061 ) and iMeinshausen and Buehlmann 






(|2006i ) for example. We again emphasize that the proposed learning algorithm CLThres has 
knowledge of neither I\ai nor k. Equipped with (Al) and (A2) and assuming the asymptotic 
behavior of e„ in ([7]), we claim the following theorem for CLThres. 

Theorem 5 (Structure Scaling Law) There exists two finite, positive constants Ci,C2 
such that if 

n>max|(21og(d-A;))^+^, Cilogd, C2logA:|, (16) 

for any C > 0, then the error probability of incorrectly learning the sequence of edge sets 
{Ep(d)}def>i tends to zero as {n,d,k) — )■ oo. When the sequence of forests are trees, n > 
Clogd (where C := max{Ci,C2}J suffices for high- dimensional structure recovery. 

Thus, if the model parameters {n,d,k) all grow with n but d = o(exp(n/Ci)), k = 
o(exp(n/C2)) and d — k = o{exjp{n^~^ /2)) (for all /3 > 0), consistent structure recovery is 
possible in high dimensions. In oth er words, the num ber of nodes d can grow faster than any 
polynomial in the sample size n. In lLiu et al.l ( 2010l ). the bivariate densities are modeled by 



functions from a Holder class with exponent a and it was mentioned (in Theorem 4.3) that 
the number of variables can grow like o(exp(n'^' '^"'""^)) for structural consistency. Our result 
is somewhat stronger but we model the pairwise joint distributions as (simpler) probability 
mass functions (the alphabet Af is a finite set). 

5.2 Extremal Forest Structures 

In this subsection, we study the extremal structures for learning, that is, the structures 
that, roughly speaking, lead to the largest and smallest error probabilities for structure 
learning. Define the sequence 

hn{P):=— log P^ (An), VneN. (17) 

Note that hn is a function of both the number of variables d = dn and the number of edges 
k = kn in the models P^"^' since it is a sequence indexed by n. In the next result, we 
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assume (n, d, k) satisfies the scaling law in ()16p and answer the following question: How 
does hn in (J17p depend on the number of edges A;„ for a given (i„? Let -P| and P2 be 
two sequences of forest-structured distributions with a common number of nodes d„ and 
number of edges A;„(P{ ) and kn{P2 ) respectively. 

Corollary 6 (Extremal Forests) Assume that CLThres is employed as the forest learning 

algorithm. As n ^ 00, hn{P[ ) < /in (^2 ) whenever kn{Pi ) > kn{P2 ) implying that hn 

is maximized when P^'^'^ are product distributions (i.e., kn = Oj and minimized when P^'^' 

3(0 
2 



are tree- structured distributions (i.e., kn = dn — 1). Furthermore, if kn{Pi ) = kn{P2 ] 
thenhn{P[''^) = hn{P^''^). 

Note that the corollary is intimately tied to the proposed algorithm CLThres. We are 
not claiming that such a result holds for all other forest learning algorithms. The intuition 
for this result is the following: We recall from the discussion after Theorem [3] that the 
overestimation error dominates the probability of error for structure learning. Thus, the 
performance of CLThres degrades with the number of missing edges. If there are very 
few edges (i.e., kn is very small relative to dn), the CLThres estimator is more likely to 
overestimate the number of edges as compared to if there are many edges (i.e., kn/dn is 
close to 1). We conclude that a distribution which is Markov on an empty graph (all variables 
are independent) is the hardest to learn (in the sense of Corollary [6] above). Conversely, 
trees are the easiest structures to learn. 

5.3 Lower Bounds on Sample Complexity 

Thus far, our results are for a specific algorithm CLThres for learning the structure of 
Markov forest distributions. At this juncture, it is natural to ask whether the scaling laws 
in Theorem[5]are the best possible over all algorithms (estimators). To answer this question, 
we limit ourselves to the scenario where the true graph Tp is a uniformly distributed chance 
variablqfl with probability measure P. Assume two different scenarios: 

~d ■ _ TaCT j-\ 1 l\T'd\ 



(a) Tp is drawn from the uniform distribution on 7^ , i.e., P(rp = t) = 1/|7^ | for all 
T^. Recall that T^ 



forests t G Tu- Recall that Tu is the set of labeled forests with d nodes and k edges. 



(b) Tp is drawn from the uniform distribution on J^ , i.e., ¥iTp = t) = 1/\T \ for all 
forests t E F"^. Recall that J-'^ is the set of labeled forests with d nodes. 



This following result is inspired by Theorem 1 in iBresler et al.l ( 20081 ). Note that an esti 



mator or algorithm T'^ is simply a map from the set of samples (Af'*)" to a set of graphs 
(either 7^ or J^ ). We emphasize that the following result is stated with the assumption 
that we are averaging over the random choice of the true graph Tp. 

Theorem 7 (Louver Bounds on Sample Complexity) Let g < 1 andr := \X\. In case 
(a) above, if 

{k-l)logd 

n<Q , (18 

dlogr 



9. The term chance variable, attributed to iGallagej l|200ll 'l. describes random quantities Y : Q, -^ W that 
take on values in arbitrary alphabets W . In contrast, a random variable X maps the sample space Q, to 
the reals R. 
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then P(T ^ Tp) — )• 1 for any estimator T : {X )^ — )• Tjf- Alternatively, in case (b), if 



loff d , , 

n<Q-^, 19 

iogr 

then P(f '^ / Tp) ^ 1 for any estimator f'^ : (^Y'^)" -^ T'^ . 

This result, a strong converse, states that n = r2(^logd) is necessary for any estimator 
with oracle knowledge of k to succeed. Thus, we need at least logarithmically many samples 
in d if the fraction k/d is kept constant as the graph size grows even if k is known precisely 
and does not have to be estimated. Interestingly, (fT8|) says that if k is large, then we need 
more samples. This is because there are fewer forests with a small number of edges as 
compared to forests with a large number of edges. In contrast, the performance of CLThres 
degrades when k is small because it is more sensitive to the overestimation error. Moreover, 
if the estimator does not know k, then (|19|) says that n = i^{logd) is necessary for successful 
recovery. We conclude that the set of scaling requirements prescribed in Theorem[5]is almost 
optimal. In fact, if the true structure Tp is a tree, then Theorem [7] for CLThres says that 
the (achievability) scaling laws in Theorem [5] are indeed optimal (up to constant factors in 
the O and r^-notation) since n > (21og((i — /c))^"*"^ in ()16p is trivially satisfied. Note that if 
Tp is a tree, then the Chow-Liu ML procedure or CLThres results in the sample complexity 
n = 0(logd) (see Theorem [5]). 

6. Risk Consistency 

In this section, we develop results for risk consistency to study how fast the parameters of 
the estimated distribution converge to their true values. For this purpose, we define the 
risk of the estimated distribution P* (with respect to the true probability model P) as 

nniP*):=D{P\\P*)-D{P\\P), (20) 

where P is the forest projection of P defined in ()13p . Note that the original probability 
model P does not need to be a forest-structured distribution in the definition of the risk. 
Indeed, if P is Markov on a forest, (I20p reduces to TZn{P*) = D{P \\ P*) since the second 
term is zero. We quantify the rate of decay of the risk when the number of samples n tends 
to infinity. For 5 > 0, we define the event 

Cn,s ■■= {x" € {X'r ■■ ^^ > ^] ■ (21) 

That is, Cn,5 is the event that the average risk TZn{P*)/d exceeds some constant 6. We say 
that the estimator P* (or an algorithm) is 6-risk consistent if the probability olCn,5 tends to 
zero as n — 7- c«. Intuitively, achieving (5-risk consistency is easier than achieving structural 
consistency since the learned model P* can be close to the true forest-projection P in the 
KL-divergence sense even if their structures differ. 

In order to quantify the rate of decay of the risk in (|2Up , we need to define some stochas- 
tic order notation. We say that a sequence of random variables Yn = Op{gn) (for some 
deterministic positive sequence {gn}) if for every e > 0, there exists a B = B^ > such that 
limsup„_^oo Pr(|y„| > Bgn) < e. Thus, Pr(|y„| > Bgn) > e holds for only finitely many n. 
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We say that a reconstruction algorithm has risk consistency of order (or rate) gn if 
TZn{P*) = Op{gn)- The definition of the order of risk consistency involves the true model 
P. Intuitively, we expect that as n ^ oo, the estimated distribution P* converges to the 
projection P so TZn{P*) — ?> in probability. 

6.1 Error Exponent for Risk Consistency 

In this subsection, we consider a fixed distribution P and state consistency results in terms 
of the event Cn^s- Consequently, the model size d and the number of edges k are fixed. This 
lends insight into deriving results for the order of the risk consistency and provides intuition 
for the high-dimensional scenario in Section | 



Theorem 8 (Error Exponent for 5- Risk Consistency) For CLThres, there exists a con- 
stant (5o > such that for all < 6 < 5o, 

limsupilogP"(C„,5)<-5. (22) 

n— >oo ri 

The corresponding lower bound is 

lim inf - log P"(C„ s) > -S d. (23) 

n— >oo n ' 

The theorem states that if 6 is sufficiently small, the decay rate of the probability of Cn.s 
is exponential, hence clearly CLThres is (^-risk consistent. Furthermore, the bounds on the 
error exponent associated to the event Cn,5 are independent of the parameters of P and only 
depend on 6 and the dimensionality d. Intuitively, (f22]l is true because if we want the risk 
of P* to be at most 6d, then each of the empirical pairwise marginals Pij should be (5-close 
to the true pairwise marginal Pij. Note also that for C^^ to occur with high probability, 
the edge set does not need to be estimated correctly so there is no dependence on k. 

6.2 The High-Dimensional Setting 

We again consider the high-dimensional setting where the tuple of parameters {n,dn,kn) 
tend to infinity and we have a sequence of learning problems indexed by the number of data 
points re. We again assume that (fill) and (fT5]l hold and derive sufficient conditions under 
which the probability of the event Cn,s tends to zero for a sequence of d-variate distributions 
^p{<ij ^ 'P{X'^)}deN- The proof of Theorem [8] leads immediately to the following corollary. 

Corollary 9 (5-Risk Consistency Scaling Law) Let 5 > be a sufficiently small con- 
stant and a G (0,5). If the number of variables in the sequence of models {P^'^'}deN satisfies 
dn = o{ex.'p{an)) , then CLThres is 5-risk consistent for {P^ '}(ieN- 

Interestingly, this sufficient condition on how number of variables d should scale with 
re for consistency is very similar to Theorem [5l In particular, if d is polynomial in re, then 
CLThres is both structurally consistent as well as (^-risk consistent. We now study the order 
of the risk consistency of CLThres as the model size d grows. 

15 



Tan, Anandkumar and Willsky 



Theorem 10 (Order of Risk Consistency) The risk of the sequence of estimated dis- 
tributions {{P^'^>)*}deN with respect to {P^'^'}deN satisfies 

K„((F""r)=oJ^V (24) 



ni-T 



1-7 



for every 7 > 0, i.e., the risk consistency for CLThres is of order (dlogd)/n 

Note that since this result is stated for the high-dimensional case, d = dn is a sequence 
in n but the dependence on n is suppressed for notational simplicity in (I24p . This result 
implies that if d = o{n^~^^) then CLThres is risk consistent, i.e., TZn{{P'^'^^)*) — > in 
probability. Note that this result is not the same as the conclusion of Corollary [9] which 
refers to the probability that the average risk is greater than a fixed constant 5. Also, the 
order of convergence given in (j24p does not depend on the true number of edges k. This is 
a consequence of the result in (i22]l where the upper bound on the exponent associated to 
the event C„ ,5 is independent of the parameters of P. 

The order of the risk, or equivalently the rate of convergence of the estimated distri- 
bution to the forest projection, is almost linear in the number of variables d and inversely 
proportional to n. We provide three intuitive reasons to explain why this is plausible: (i) 
the dimension of the sufficient statistics in a tree-structured graphical model is of order 
0{d) (ii) the ML estimator of the natural paramete r s of a n exponential family converge to 
their true values at the rate of Op{n~^''^) ( Serfiingi . llQSOl . Sec. 4.2.2) (iii) locally, the KL- 



divergence behaves l i ke th e square of a weighted Euclidean norm of the natural parameters 
(jCover and Thomas! . [JnO^ . Eq. (11.320)). 



We now compare Theorem 1101 to the corresponding results in lLiu et al.l ( 2010l ). In these 



recent papers, it was shown that by modeling the bivariate densities Pjj- as functions from a 
Holder class with exponent a > and using a reconstruction algorithm based on validation 
on a held-out dataset, the risk decays at a rateo of Op((in~°' ^""^"^^"^ ) , which is slower than the 
order of risk consistency in ()24p . This is due to the need to compute the bivariate densities 
via kernel density estimation. Furthermore, we model the pairwise joint distributions as 
discrete probability mass functions and not continuous probability density functions, hence 
there is no dependence on Holder exponents. 

7. Numerical Results 

In this section, we perform numerical simulations on synthetic and real datasets to study 
the effect of a finite number of samples on the probability of the event A,n defined in ([5]). 
Recall that this is the error event associated to an incorrect learned structure. 

7.1 Synthetic Datasets 

In order to compare our estimate to the ground truth graph, we learn the structure of 
distributions that are Markov on the forest shown in Figure [2j Thus, a subgraph (nodes 
1, . . . , fc + 1) is a (connected) star while nodes A: + 2, . . . , d — 1 are not connected to the star. 
Each random variable Xj takes on values from a binary alphabet X = {0, 1}. Furthermore, 



10. The Op(-) notation suppresses the dependence on factors involving logarithms. 

16 



Learning High-Dimensional Markov Forest Distributions 



Xk+2 Xk^ 



Xd 




Figure 2: The forest-structured distribution Markov on d nodes and k edges. 
Xfc+i, . . . , Xd are not connected to the main star graph. 
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Number of samples n 



Figure 3: The error probabihty of structure learning for (3 € (0, 1) 



Pj{xj) = 0.5 for Xj = 0, 1 and all j G V. The conditional distributions are governed by the 



"binary symmetric channel": 



Pj\l{Xj\xi) 



0.7 X 



3 



Xi 



0.3 Xj / xi 



for j = 2, . . . , k+1. We further assume that the regularization sequence is given by e„ := ^"'^ 
for some /3 E (0, 1). Recall that this sequence satisfies the conditions in ([7]). We will vary f3 
in our experiments to observe its effect on the overestimation and underestimation errors. 
In Figure O we show the simulated error probability as a function of the sample size 
n for a d = 101 node graph (as in Figure [2]) with A; = 50 edges. The error probability is 
estimated based on 30,000 independent runs of CLThres (over different datasets x"). We 
observe that the error probability is minimized when /3 ~ 0.625. Figure|4]show the simulated 
overestimation and underestimation errors for this experiment. We see that as /3 — > 0, the 
overestimation (resp. underestimation) error is likely to be small (resp. large) because the 
regularization sequence e„ is large. When the number of samples is relatively small as in 
this experiment, both types of errors contribute significantly to the overall error probability. 
When /3 ~ 0.625, we have the best tradeoff between overestimation and underestimation 
for this particular experimental setting. 
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Figure 4: The overestiniation and underestimation errors for /? G (0, 1). 



Even though we mentioned that (3 in (llip should be chosen to be close to zero so 
that the error probability of structure learning decays as rapidly as possible, this example 
demonstrates that when given a finite number of samples, /3 should be chosen to balance 
the overestimation and underestimation errors. This does not violate Theorem [3] since 
Theorem [3] is an asymptotic result and refers to the typical way an error occurs in the 
limit as n — 7> oo. Indeed, when the number of samples is very large, it is shown that the 
overestimation error dominates the overall probability of error and so one should choose /3 
to be close to zero. The question of how best to select optimal /3 when given only a finite 
number of samples appears to be a challenging one. We use cross-validation as a proxy to 
select this parameter for the real-world datasets in the next section. 

In Figure [SI we fix the value of /3 at 0.625 and plot the KL-divergence D[P\\P*) as 
a function of the number of samples. This is done for a forest-structured distribution P 
whose graph is shown in Figure [2] and with d = 21 nodes and A; = 10 edges. The mean, 
minimum and maximum KL-divergences are computed based on 50 independent runs of 
CLThres. We see that logD(P||P*) decays linearly. Furthermore, the slope of the mean 
curve is approximately —1, which is in agreement with (I24p . This experiment shows that if 
we want to reduce the KL-divergence between the estimated and true models by a constant 
factor j4 > 0, we need to increase the number of samples by roughly the same factor A. 



7.2 Real datasets 

We now demonstrate how well forests-structured distributions can model two real datasets^ 



which are obtained from the UCI Machine Learning Repository ( Newman et al.l . ll998l ). The 



first dataset we used is known as the SPECT Heart dataset, which describes diagnosing of 
cardiac Single Proton Emission Computed Tomography (SPECT) images on normal and 
abnormal patients. The dataset contains d = 22 binary variables and n = 80 training 
samples. There are also 183 test samples. We learned a forest-structured distributions 
using the 80 training samples for different /3 € (0, 1) and subsequently computed the log- 
likelihood of both the training and test samples. The results are displayed in Figure [6l We 



11. These datasets are typically employed for binary classification but we use them for modeling purposes. 
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Figure 5: Mean, minimum and maximum (across 50 different runs) of the KL-divergence 
between the estimated model P* and the true model P for a d = 21 node graph 
with k = 10 edges. 
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Figure 6: Log-likelihood scores on the SPECT dataset 



observe that, as expected, the log-likelihood of the training samples increases monotonically 

with f3. This is because there are more edges in the model when /3 is large improving the 

modeling ability. However, we observe that there is overfitting when /3 is large as evidenced 

by the decrease in the log-likelihood of the 183 test samples. The optimal value of /3 in 

terms of the log-likelihood for this dataset is ~ 0.25, but surprisingly an approximation with 
11 



an empty grapH I also yields a high log-likelihood score on the test samples. This implies 
that according to the available data, the variables are nearly independent. The forest graph 
for /3 = 0.25 is shown in Figure [7|^ a) and is very sparse. 



12. When /3 = we have an empty graph because all empirical mutual information quantities in this 
experiment are smaller than 1. 
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(a) 



(b) 



Figure 7: Learned forest graph of the (a) SPECT dataset for (3 = 0.25 and (b) HEART 
dataset for /3 = 0.53. Bold edges denote higher mutual information values. The 
features names are not provided for the SPECT dataset. 



The second dataset we used is the Statlog Heart dataset containing physiological mea- 
surements of subjects with and without heart disease. There are 270 subjects and d = 13 
discrete and continuous attributes, such as gender and resting blood pressure. We quantized 
the continuous attributes into two bins. Those measurements that are above the mean are 
encoded as 1 and those below the mean as 0. Since the raw dataset is not partitioned into 
training and test sets, we learned forest-structured models based on a randomly chosen set 
of n = 230 training samples and then computed the log-likelihood of these training and 40 
remaining test samples. We then chose an additional 49 randomly partitioned training and 
test sets and performed the same learning task and computation of log-likelihood scores. 
The mean of the log-likelihood scores over these 50 runs is shown in Figure [H We observe 
that the log- likelihood on the test set is maximized at (3 ~ 0.53 and the tree approximation 
(/3 ~ 1) also yields a high likelihood score. The forest learned when /3 = 0.53 is shown 
in Figure [Tl^b). Observe that two nodes (ECG and Cholesterol) are disconnected from the 
main graph because their mutual information values with other variables are below the 
threshold. In contrast, HeartDisease, the label for this dataset, has the highest degree, i.e., 
it influences and is influenced by many other covariates. The strengths of the interactions 
between HeartDisease and its neighbors are also strong as evidenced by the bold edges. 

From these experiments, we observe that some datasets can be modeled well as proper 
forests with very few edges while others are better modeled as distributions that are almost 
tree-structured (see Figure [7]). Also, we need to choose /3 carefully to balance between data 
fidelity and overfitting. In contrast, our asymptotic result in Theorem [3] says that En should 
be chosen according to ([7]) so that we have structural consistency. When the number of data 
points n is large, f3 in ([TT]) should be chosen to be small to ensure that the learned edge set 
is equal to the true one (assuming the underlying model is a forest) with high probability 
as the overestimation error dominates. 
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Figure 8: Log-likelihood scores on the HEART dataset 



8. Conclusion 

In this paper, we proposed an efficient algorithm CLThres for learning the parameters and 
the structure of forest-structured graphical models. We showed that the asymptotic error 
rates associated to structure learning are nearly optimal. We also provided the rate at 
which the error probability of structure learning tends to zero and the order of the risk 
consistency. One natural question that arises from our analyses is whether /3 in (|lip can 
be selected automatically in the finite-sample regime. There are many other open problems 
that could possibly leverage on the proof techniques employed here. For example, we 
are currently interested to analyze the learning of general graphical models using similar 
thresholding- like techniques on the empirical correlation coefficients. The analyses could 
potentially leverage on the use of the method of types. We are currently exploring this 
promising line of research. 
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Appendix A. Proof of Proposition [2] 

Proof (Sketch) The proof of this result hinges on the fact that both the overestimation 
and underestimation errors decay to zero exponentially fast when the threshold is chosen to 
be /min/2. This threshold is able to differentiate between true edges (with MI larger than 
-^min) from non-edges (with MI smaller than /min) with high probability for n sufficiently 
l arge. The error for learning the top k edges of the forest also decays exponentially fast 

|6]) holds. The full details of the proof follow in a straightforward 
I which we present next. ■ 



dTaneLaLl, 120111). Thus, 



manner from Appendix [ 
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Appendix B. Proof of Theorem [3] 

Define the event &„ '■= {-Efc ¥" Ep}, where E'fc = {ei, • • • ,6^} is the set of top k edges (see 
Step [3] of CLThres for notation). This is the Chow-Liu error as mentioned in Section [4.31 
Let B^ denote the complement of Bn- Note that in B^, the estimated edge set depends 
on k, the true model order, which is a-priori unknown to the learner. Further define the 
constant 

Kp:= hm -- log P^{Bn). (25) 

71— >oo n 

In other words, Kp is the error exponent for learning the forest structure incorrectly assum- 
ing the true model order k is know n and Chow-Li u terminates after the addition of exactly 
k edges in the MWST procedure (JKruskall . Il956l ) . The existence of the limit in ([25]) and 



the positivity of Kp follow from the main results in lTan et al.l ( 201ll ). 

We first state a result which relies on the Gallager-Fano bound ( Fanol . Il96ll . pp. 24). 
The proof will be provided at the end of this appendix. 

Lemma 11 (Reduction to Model Order Estimation) For every rj G {0,Kp), there 
exists a A^ G N sufficiently large such that for every n > N , the error probability P"'{An) 
satisfies 

{l-r])P^(kn^k\B'J<P''{An) (26) 

< P^(kn + k\Bl) + 2exp(-n(Kp - r?)). (27) 

Proof (o/ Theorem\^ We will prove (i) the upper bound in ([9]) (ii) the lower bound in ([8]) 
and (iii) the exponential rate of decay in the case of trees (jlOp . 

Proof of upper bound in Theorem [3] 

We now bound the error probability P^{kn ^ k\B'f^) in (|27|) . Using the union bound, 

P^ikn + k\B'J < P"(fc„ > k\B'J + P'^ikn < k\B'J. (28) 

The first and second terms are known as the overestimation and underestimation errors 
respectively. We will show that the underestimation error decays exponentially fast. The 
overestimation error decays only subexponentially fast and so its rate of decay dominates 
the overall rate of decay of the error probability for structure learning. 

Underestimation Error 

We now bound these terms staring with the underestimation error. By the union bound, 

P''{K<k\B'J<{k-l) max P^(kn=j\B'J 

l<j<k—l 

= {k- l)P^iK = k- l\Bl), (29) 

where (I29p follows because -P"(/cn = jWn) i^ maximized when j = k — 1. This is because if, 
to the contrary, P^{kn = j\B'^) were to be maximized at some other j < k — 2, then there 
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exists at least two edges, call them ei,e2 € Ep such that events Si := {/(Pei) < ^n} and 
£2 '■= {^(-^62) ^ £ra} occur. The probability of this joint event is smaller than the individual 
probabilities, i.e., P"{£i fl £"2) < min{P"(£'i), ^"(£"2)}. This is a contradiction. 
By the rule for choosing /c„ in (JH), we have the upper bound 

P"(fc„ = k- 1\B^) = P"(3e E Ep s.t. I{Pe) < En) < A;maxP"(/(Pe) < ^n), (30) 

eeEp 

where the inequality follows from the union bound. Now, note that if e € Ep, then I(Pp) > 
En fo r n sufficiently large (since e„ — >■ 0). Thus, by Sanov's theorem ( Cover and Thomad . 



20061, Ch. 11), P"(/(Pe) < £«) can be upper bounded as 

P"(/(Pe) < en) <{n + ly' exp ( -n min {D{Q \\Pe): I{Q) < SnlV (31) 



Define the good rate function (JDembo and Zeitounil . Il998l ) in ([3T|) to be L : ViX ) x 

[0, 00) — 7> [0, 00), which is given by 

L(Pe;a):= min {D(Q\\Pe):I(Q)<a}. (32) 

Clearly, L{Pe;a) is continuous in a. Furthermore it is monotonically decreasing in a for 
fixed Pe- Thus by using the continuity of L{Pe; •) we can assert: To every ry > 0, there 
exists a A^ € N such that for all n > A^ we have L{P(>;en) > L{Pe;0) — r]. As such, we can 
further upper bound the error probability in (f3T]l as 



P"(/(Pe) < Sn) < (n + 1)^ exp (-n(L(Pe; 0) - 7?)) . (33) 

By using the fact that /min > 0, the exponent L{Pe;0) > and thus, we can put the pieces 
in (j29|) . pO]) and ([33|) together to show that the underestimation error is upper bounded as 



P"(A;„ < k\B^^) < k{k - l)(n + 1)''' exp ( -n min (L(Pe; 0) - ??) ) . (34) 

e£Ep 

Hence, if k is constant, the underestimation error P^{kn < k\B'f^) decays to zero exponen- 
tially fast as n — 7> 00, i.e, the normalized logarithm of the underestimation error can be 
bounded as 

limsup - logP"(A?„ < k\B^) < - min (L(Pe; 0) - ??). 

n-).oo n eeEp 

The above statement is now independent of n. Hence, we can take the limit as r/ ^- to 
conclude that: 

limsup - log P"(A;„ <A;|S^) < -Lp. (35) 

The exponent Lp := mineg^p L(Pe;0) is positive because we assumed that the model is 
minimal and so /min > 0, which ensures the positivity of the rate function L{Pe] 0) for each 
true edge e G Ep. 
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SQ:I{Q) = en,} 

Pe : /(Pe) = 




_ "2 

decreasing e„ 

{Q:I{Q)=em? 

Figure 9: As e„ — ^ 0, the projection of Pg onto the constraint set {Q : I{Q) > £n}, denoted 
Q* (the optimizer in (j39|) ). approaches Pg- The approximations in (jl0|) and (j^T]) 
become increasingly accurate as e„ tends to zero. In the figure, n2 > ni and 
^ni > £n2 £^i^d the curves are the (sub-)manifold of distributions such that the 
mutual information is constant, i.e., the mutual information level sets. 



OVERESTIMATION ERROR 

Bounding the overestimation error is harder. It follows by first applying the union bound: 



P'^ikn > k\B^) <{d-k-l) max P"(A;n = j\13: 

k+l<j<d—l 



= {d-k-l)P'\kn = k + l\Bl), (36) 

where (I36p follows because P"'{kn = j\B!f^) is maximized when j = k + 1 (by the same 
argument as for the underestimation error). Apply the union bound again, we have 

P''(kn = k + l\B^)<{d-k-l) max P"(/(Pe) > e„). (37) 

e€VxV:I{Pe)=0 

From (|37p , it suffices to bound P^{I{Pe) > £n) for any pair of independent random variables 
(Xj^Xj) and e = { i .,j). We proceed by applying the upper bound in Sanov's theorem 
dCover and Thomal IJOOJ . Ch. 11) to P'^iliPe) > £n) which yields 



P^'iliPe) > £n) <{n + ly exp -n min {D{Q 1 1 Pe) : I{Q) > En} ] , (38) 



for all n € N. Our task now is to lower bound the good rate function in (j38p . which we 
denote as M : V{X'^) x [0, oo) -^ [0, oo): 

M(Pe;b):= min {D(Q\\Pe):I(Q)>b}. (39) 

Note that M^P^; b) is monotonically increasing and continuous in b for fixed Pg. Because the 
sequence {en}nGN tends to zero, when n is suf ficiently large, £„ is arbitrarily small and w e 
are in the so-called very-noisy learning regime ( Borade and Zhengj . 120081 : iTan et al.l , l201ll ). 



where the optimizer to (f39l) . denoted as Q"^, is very close to Pg. See Figure [H Thus, when 
n is large, the KL-divergence and mutual information can be approximated as 

^(Q;il^e) = VngV + o(||v||2), (40) 

/(Q:) = VH,v + o(||vf), (41) 
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whera^i v := vec(Q* ) — vec(Pe) £ ^^ ■ The r^ x r^ matrices lie and He are defined as 

ne:=diag(l/vec(Pe)), (42) 

He:=VL(Q)/(vec(Q))|Q^^^. (43) 

In other words, lie is the diagonal matrix that contains the reciprocal of the elements of 
vec(Pe) on its diagonal. He is the Hessian^i of /(vec((5* )), viewed as a function of vec{Q'^) 
and evaluated at Pe- As such, the exponent for overestimation in ()39p can be approximated 
by a quadratically constrained quadratic program (QCQP), where z := vec(Q) — vec(Pe): 

M{Pe;en)= min -z'^HeZ, 
subject to -z^HeZ > e„, z^l = 0. (44) 

Note that the constraint z^l = does not necessarily ensure that Q is a probability 
distribution so M(Pe;£n) is a lower bound to the true rate function M{Pe;£n), defined 
in (j39p . We now argue that the approximate rate function M in ()44p . can be lower bounded 
by a quantity th at is proportional to £„. To show this, we resort to Lagrangian duality 
( Bertsekasl . Il999l . Ch. 5). It can easily be shown that the Lagrangian dual corresponding to 



the primal in (I44p is 



g{Pe;en) ■= £n max{/i : He >z fiUe}. (45) 

fi>0 



We see from ()15]) that g{Pe;en) is proportional to e„. By weak duality ( Bertsekasl . Il999l . 



Proposition 5.1.3), any dual feasible solution provides a lower bound to the primal, i.e., 

giPe;en)<MiPe;en). (46) 

Note that strong duality (equality in (j46]l ) does not hold in general due in part to the 
non-convex constraint set in (I44p . Interestingly, ou r manipulations lead lower boun ding M 
by (j45p , which is a (convex) semidefinite program rVand enberghe and BovdI . 1 19961 ) . 



Now observe that the approximations in (j40p and (I4ip are accurate in the limit of large 
n because the optimizing distribution Q* becomes increasingly close to Pe- By continuity of 
the optimization problems in (perturbations of) the objective and the constraints, M{Pe;en) 
and M{Pe;en) are close when n is large, i.e.. 



lim 

n— >oo 



M(Pe;en)-M(Pe;en) =0. (47) 



By applying the continuity statement above to (j38p . for every ?/ > 0, there exists a A^ € N 
such that 

P^{I{Pe) > Sn) <{n + if exp (-n{M{Pe; e„) - r?)' 



13. The operator vec(C) vectorizes a matrix in a column oriented way. Thus, il C G R'^', vec(C) is a 
length-/^ vector with the columns of C stacked one on top of another (C(:) in Matlab). 

14. The first two terms in the Taylor expansion of the mutual information I{vec{Qn)) in l|41[l vanish because 
(i) /(Pe) = and (ii) (vec((5*)— vec(Pe))^Vvcc(Q)/(vec(Pe)) = 0. Indeed, if we expand J (vec(Q)) around 
a product distribution, the constant and linear terms vanish (jBorade and Zhenej . 120081 ). Note that He 
in (|43|l is an indefinite matrix because I{vec{Q)) is not convex. 
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for all n > N. Define the constant 

cp := min max {/i : lip >~ uHpj. (48) 

By (gg), (06]) and the definition of cp in (08]), 

P"(/(^e) > en) <{n + ly' exp {-nen{cp - rj)) . (49) 

Putting ([5S|) . (f57|) and ([M]) together, we see that the overestimation error 

P"(:fc„ > k\Bl) <{d-k-lf{n + ly^ exp {-nen{cp - i])) . (50) 

Note that the above probability tends to zero by the assumption that ne„/logn — )• oo 
in d?]). Thus, we have consistency overall (since the underestimation, Chow-Liu and now 
the overestimation errors all tend to zero). Thus, by taking the normalized logarithm 
(normalized by ne„), the limsup in n (keeping in mind that d and k are constant), we 
conclude that 

limsup — log P"(A;„ > A:|S^) < -(cp -r?). (51) 

n— >oo fl^n 

If we now allow rj in (j5ip to tend to 0, we see that it remains to prove that cp = 1 for all P. 
For this purpose, it suffices to show that the optimal solution to the optimization problem 
in (flSl) . denoted /i*, is equal to one for all lie and Hg. Note that ji* can be expressed in 
terms of eigenvalues: 

/.* = (max{eig(n,"i/2Hen-i/2)})"' , (52) 

where eig(A) denotes the set of real eigenvalues of the symmetric matrix A. By using 
the definitions of He and Hg in (j42p and (j43p respectively, we can verify that the matrix 



I — He Hglle is positive semidefinite with an eigenvalue at zero. This proves that the 
largest eigenvalue of He Helle is one and hence from (j52p . //* = !. The proof of the 



upper bound in ([9]) is completed by combining the estimates in (i27|l . ([351) and (i5T]l . 

Proof of lower bound in Theorem [3] 

The key idea is to bound the overestimation error using a modification of the lower bound 
in Sanov's theorem. Denote the set of types supported on a finite set y with denominator 
n as Vn {y) and the type class of a distribution Q G Vn (y) as 

Tn{Q):={y''ey^:Pi-;yn = Q{-)}, 

where P{ ■ ; y") is the empirical distributio n of the sequence ?/" ^Ini^ ■ ■ ■ > Vn)- The following 
bounds on the type class are well known ( Cover and Thomasl . 120061 . Ch. 11). 



Lemma 12 (Probability of Type Class) For any Q € Vniy) and any distribution P, 
the probability of the type class T„(Q) under P" satisfies: 

(n + l)-l^l exp{-nD{Q \\P))< P"(T„(Q)) < exp{-nD{Q\\P)). (53) 
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To prove the lower bound in ^, assume that k < d — 1 and note that the error probabihty 
P^{kn 7^ ^Wn) can be lower bounded by P^{I{Pe) > £n) for any node pair e such that 
I{Pe) = 0. We seek to lower bound the latter probability by appealing to (153]) . Now choose 
a sequence of distributions Q^"'' G {Q G V^i^^) '■ HQ) ^ ^n} such that 



lim 

n— foo 



M(Pe;e„)-^(Q("Ml^e 



0. 



This is possible because the set of types is dense in the probability simplex (JDembo and Zeitounil . 
19981 . Lemma 2.1.2(b)). Thus, 



P"(/(Pe) > e„: 



Y^ P"(T„(Q)) 

QeP„(A'2);/(Q)>e„ 

> P"(T„(Q("))) 



> (n + 1) 



exp( 



-nZ?(QW II Pe)), 



(54) 



where (I54p follows from the lower bound in (I53p . By applying (I47p . and using the fact that 
if |an — 6ra| — ^ and |6n — c„| — )• then, |an — c„| ^ (triangle inequality), we also have 



lim 



M{Pe;en)-D{Q^^^\\Pe 



0. 



Hence, continuing the chain in 

n> N, 



for any 77 > 0, there exists a A^ G N such that for all 



^"(/(Pe) > en) > (n + 1)-^ exp(-n(M(Pe; £«) + r/)). 



(55) 



Note that an upper bound for M{Pe]£n) in (|44l) is simply given by the objective evaluated 
at any feasible point. In fact, by manipulating (|li|) . we see that the upper bound is also 
proportional to e^, i.e., 

M{Pe;en)<Cp^en, 

where Cp^ G (0, 00) is some constant!^ that depends on the matrices lie and Hg. Define 
Cp := maXggyxV:/(Pe)=o C'Pj. Continuing the lower bound in ([55]) . we obtain 

P''{I{Pe) > Sn) >{n + 1)-'^' exp(-ne„(Cp + r/)), 
for n sufficiently large. Now take the normalized logarithm and the lim inf to conclude that 



liminf — logP'^(:fc„ / k\B^) > -{Cp + r?). 



(56) 



Substitute (j56p into the lower bound in (j26p . Now the resulting inequality is independent 
of n and we can take ry — )• to complete the proof of the lower bound in Theorem [3l 



15. We can easily remove the constraint z 1 in (|44|l by a simple change of variables to only consider those 
vectors in the subspace orthogonal to the all ones vector so we ignore it here for simplicity. To obtain 
Cp^, suppose the matrix We diagonalizes He, i.e., He = WJTDeWe, then one can, for example, choose 

Cp^ = min^.p^]^ .>o[W^neWe]j,i. 
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Proof of the exponential rate of decay for trees in Theorem [3] 

For the claim in (jlOp . note that for n sufficiently large, 

P-(A) > max{(l - r?)P"(fc„ / fc„|S^),P"(S„)}, (57) 

from Lemma [TT] and the fact that Bn C An- Eq. (|57p gives us a lower bound on the error 
probability in terms of the Chow-Liu error P'^{Bn) and the underestimation and overes- 
timation errors P'^{kn 7^ kn\B!f^). If k = d — 1, the overestimation error probability is 
identically zero, so we only have to be concerned with the underestimation error. Further- 
more, from (j35|) and a corresponding lower bound which we omit, the underestimation error 
event satisfies P'^(kn < k\B'^) = exp(— nLp). Combining this fact with the definition of the 
error exponent Kp in (j25|) and the result in (|57p establishes (|lUp . Note that the relation 
in (j57p and our preceding upper bounds ensure that the limit in (jlOp exists. ■ 



Proof ( of Lemma FTTI) We note that P^''{An\kn 7^ k) = 1 and thus, 

P"(A) < P'\K ^k) + P"(A|^n = k). (58) 

By using the definition of Kp in (j25]) . the second term in (f58|) is precisely P^{Bn) therefore, 

P'^iAn) < P'^iK ^k) + exp{-n{Kp - rj)), (59) 

for all n > Ni. We further bound P^{kn ^ k) by conditioning on the event 0^. Thus, for 
r/>0, 

P"(fen ^ A:) < P^ikn + k\Bl) + P"(^„) 

< P"(:fc„ / A:|^^) + exp(-n(ii-p - r?)), (60) 

for all n > N2. The upper bound result follows by combining (|59p and (|6Up . The lower 
bound follows by the chain 

P"(A) > P'^ikn ^k)> P'^iikn ^k}r\ Bl) 

= P^{kn + k\B'^)P^{B';,) > (1 - r?)P"(fc„ / A;|S^), 

which holds for all n > N3 since P"'{B^) — > 1. Now the claims in (|26p and (|27p follow by 
taking N := max{7Vi, iV2, N3}. ■ 



Appendix C. Proof of Corollary [4] 

Proof This claim follows from the fact that three errors (i) Chow-Liu error (ii) underesti- 
mation error and (iii) overestimation error behave in exactly the same way as in Theorem[3l 
In particular, the Chow-Liu error, i.e., the error for the learning the top k edges in the forest 
projection model P decays with error exponent Kp. The underestimation error behaves as 
in (1351) and the overestimation error as in (1511). ■ 
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Appendix D. Proof of Theorem [5] 

Proof Given assumptions (Al) and (A2), we claim that the underestimation exponent 
Lp(d), defined in (1351) . is uniformly bounded away from zero, i.e., 

L := inf Lp(d) = inf min L(P^'^^;0) (61) 

deN ^ deNeeEp^a) 

is positive. Before providing a formal proof, we provide a plausible argument to show that 
this claim is true. Recall the definition of L{Pe;0) in (]32p . Assuming that the joint Pe = Pij 
is close to a product distribution or equivalently if its mutual information I{Pe) is small 
(which is the worst-case scenario), 

L{Pe;0)^ min {D{Pe\\Q) : I{Q) = 0} (62) 

= D{Pe\\Pi Pj) = I{Pe) > /inf > 0, (63) 

where in (|62p . the arguments in the KL-divergence have been swapped. This is because 
when Q K, P^ entry- wise, -D((5 || -P r) ^ DJPe. II Q) in the seii se that their difference is small 
compared to their absolute values (JBorade and Zheng) . 120081 ) . In (|63l) , we used the fact that 



the reverse I-projection of Pe onto the set of product distributions is PiPj- Since Ijnf is 
constant, this proves the claim, i.e., L > 0. 
More formally, let 

Bk,' ■= {Qi,j G ViX"^) : Qi,jixi,Xj) > K.',yxi,Xj G X} 

be the set of joint distributions whose entries are bounded away from zero by k' > 0. 
Now, consider a pair of joint distributions Pe ,Pe G -Br' whose minimum values are 
uniformly bounded away from zero as assumed in (A2). Then there exists a Lipschitz 
constant (independent of d) C/ € (0, cxo) such that for all d, 

|/(Pi'^)) - /(PW)| < C/||vec(pW) - vec(pW)||i, (64) 

where || • ||i is the vector ii norm. In fact, U := maxQg^ , ||V/(vec((5))||oo is the Lipschitz 
constant of /(•) which is uniformly bounded because the joint distributions Pe and Pe 
are assumed to be uniformly bounded away from zero. Suppose, to the contrary, L = 0. 
Then by the definition of the infimum in ()6ip . for every e > 0, there exists a d G N and a 
corresponding e G Ep{d) such that if Q* is the optimizer in (p2]) . 

. ^ n(n* II PWI 5 ||vec(Pi'^Vvec(Q*)||f W IIJP^"^) - I{Q*)\^ W ij, 

^^ " ^ '- 21og2 - (21og2)C/2 - (21og2)[/2' 



where (a) follows from Pinsker's inequality ([Cover and Thomasl . l2006l . Lemma 11.6.1), (b) 



is an application of ()64p and the fact that if Pe G 5^ is uniformly bounded from zero (as 
assumed in (fTCj) ) so is the associated optimizer Q* (i.e., in B^^ii for some possibly different 
uniform k" > 0). Statement (c) follows from the definition of /jnf and the fact that Q* is 
a product distribution, i.e., I{Q*) = 0. Since e can be chosen to be arbitrarily small, we 
arrive at a contradiction. Thus L in (|6ip is positive. Finally, we observe from (j34p that if 
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n > (3/L) log k the underestimation error tends to zero because p4p can be further upper 
bounded as 

P"(A;„ < k\B^) < (n + !)'■" exp(21og A; - nL) < (n + 1)''' exp ( -nL - nL J ^ 

as n — > oo. Take C2 = 3/L in (fT6|) . 

Similarly, given the same assumptions, the error exponent for structure learning Kp(d) , 
defined in ()25p . is also uniformly bounded away from zero, i.e., 

K := inf Kp^d) > 0. 
den ^ 

Thus, if n > {^/K) log d, the error probability associated to estimating the top k edges 
(event Bn) decays to zero along similar lines as in the case of the underestimation error. 
Take Ci = 4/i^ in (fTUjl . 

Finally, from (|50p . if ne„ > 2\og{d — k), then the overestimation error tends to zero. 
Since from ([7]), En can take the form n~^ for /3 > 0, this is equivalent to n^~f^ > 2log{d — k), 
which is the same as the first condition in (J16p . namely n > (21og((i — A;))^^^. By ()27p 
and (j28p . these three probabilities constitute the overall error probability when learning 
the sequence of forest structures {Ep(d)}deN- Thus the conditions in ([T6|) suffice for high- 
dimensional consistency. ■ 



Appendix E. Proof of Corollary [6] 

Proof First note that /c„ € {0, . . . ,dn — 1}- From ()50p . we see that for n sufficiently large, 
the sequence hn{P) := (ne„)^^ logP"(^„) is upper bounded by 

2 , ,, , ^ r^logfn+l) , , 

_1 + iog(d„ -kn-l) + ^^ ^ (65) 

The last term in (f65]l tends to zero by ([7]). Thus hn{P) = 0((ne„)~^ log(d„ — A:„ — 1)), where 
the implied constant is 2 by (j65p . Clearly, this sequence is maximized (resp. minimized) 
when kn = (resp. kn = dn — 1)- Eq. ()65p also shows that the sequence /i„ is monotonically 
decreasing in kn- ■ 



Appendix F. Proof of Theorem [7] 

Proof We first focus on part (a). Part (b) follows in a relatively straightforward manner. 
Define 

Tmap(x") := argmax P(Tp = t|x") 

to be the maximum a-posteriori (MAP) decoding rule|l£| By the optimality of the MAP 
rule, this lower bounds the error probability of any other estimator. Let W := Tmap{{^ )^) 



16. In fact, this proof works for any decoding rule, and not just the MAP rule. We focus on the MAP rule 
for concreteness. 
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be the range of the function Tmap, i-e., a forest t G W if and only if there exists a sequence 
x" such that Tmap = t. Note that W U W^ = T^. Then, consider the lower bounds: 

P(f ^Tp)= Y^ P(f / Tp\Tp = t)F{Tp = t) 

P(f ^Tp\Tp = t)¥{Tp = t) 

\Tp = t) 



i^%^ 



> 



tew 



(66) 

(67) 

(68) 

where in ([MD, we used the fact that P(r / Tp jTp = t) = 1 if t e W^ in §7^, the fact 
that P(Tp = t) = l/\T^\. In dMD, we used the observation |>V| < (!;f''|)'" = r"'^ since the 
function Tmap '■ (A"^)" — » W is surj e ctive. Now, the number of labeled forests with k edges 
and d nodes is (jAigner and Zieglerl . |2009| . pp. 204) \T^\ > {d — k)d''~^ > d^~^ . Applying 



y, HTp =t) = i- 


->: 

iGW 


1 - y. \Ti\-^ 

tGW 

1 „nd\'-i-d\-l 
^-r \lk\ 1 





this lower bound to (j68p . we obtain 

P(f/ Tp) > l-exp(ndlogr- (A;-l)logfi) > 1 - exp ((£» - 1)(A; - 1) logd) 



(69) 



where the second inequality follows by choice of n in (jlSp . The estimate in ()69p converges 
to 1 as {k, d) — )• oo since ^ < 1. The same reasoning applies to part (b) but we i i istead use 
the following estimates of the cardinality of the set of forests ( Aigner and Zieglerl . l2009l . Ch. 
30): 

{d - 2) log d < log \F'^\ <{d-l) \og{d + 1). (70) 

Note that we have l ower bounded \J-'^\ by the number trees with d nodes which is d'^~'^ 
by Cayley's formula (JAigner and Zieglerl . l2009l . Ch. 30). The upper bound^l follows by a 
simple combinatorial argument which is omitted. Using the lower bound in (|70p . we have 



P(f/ Tp) > l-exp(ndlogr)exp(-((i-2)logd) > 1 - d2exp((^ - l)dlogd), (71) 
with the choice of n in (I19p . The estimate in (I7ip converges to 1, completing the proof. ■ 



Appendix G. Proof of Theorem [8] 

Proof We assume that P is Markov on a forest since the extension to non-forest-structured 
P is a straightforward generalization. We start with some useful definitions. Recall from 
Appendix [Bl that Bn '■= {E^ / Ep} is the event that the top k edges (in terms of mutual 
information) in the edge set E^-i are not equal to the edges in Ep. Also define Cn,5 '■= 
{D(P* II P) > 5d} to be the event that the divergence between the learned model and the 
true (forest) one is greater than 5d. We will see that Cn,5 is closely related to the event of 



17. The purpose of the upper bound is to show that our estimates of | J^'*] in (|70|) are reasonably tight 
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E-r 



6 



Ep 



6. 



Figure 10: In E-r (left), nodes 1 and 5 are the roots, which are in blue. The parents are 

7r(i;^r ) = 0for i = 1,5. In 
i - 1 for i = 2, 3, 4 but 7r(z; Ep) 



defined as vrfz: Et ) = i — 1 for z = 2, 3, 4, 6 and 7r(i: Et ) = for i = 1, 5. In Ep 



(right), the parents are defined as 7r(z; Ep) - 
for i = 1, 5, 6 since (5, 6), (0, 1), (0, 5) ^ Ep. 



interest Cn,s defined in ([2T]) . Let Un := {kn < k} be the underestimation event. Our proof 
relies on the following result, which is similar to Lemma [TTJ hence its proof is omitted. 

Lemma 13 For every r] > 0, there exists a iV G N such that for all n > N , the following 
hounds on P"'{Cn,s) hold: 



{l-r^)P^{CnABl,K)<P''{Cn,6) 

< P''{Cn,s\B'n,K) + exp{-n{mm{Kp, Lp} - r?)). 



(72) 
(73) 

Note that the exponential term in (j73p comes from an application of the union bound and 
the "largest-exponent-wins" principle in large-deviations theory. From (j72p and (|73p we see 
that it is possible to bound the probability of Cn,s by providing upper and lower bounds 
for P"(C„_5|yBJ^,W^). In particular, we show that the upper bound equals ex.p{—n6) to first 
order in the exponent. This will lead directly to (I22p . T o proceed, we rely on the following 
lemma, which is a generalization of a well-known result ( Cover and Thomad . l2006l . Ch. 11). 
We defer the proof to the end of the section. 

Lemma 14 (Empirical Divergence Bounds) Let X, Y be two random variables whose 
joint distribution is Px,Y &'P{X'^) and\X\ = r. Let {x"^,!/^) = {{xi,yi), . . . ,{xn,yn)} ben 
independent and identically distributed observations drawn from PxY- Then, for every n, 



Px,y{D{Px\y\\Px\y)>S) < (n + 1)^ exp(-n5), 
where Px\y = Px,y/Py is the conditional type of{x'^,y"'). Furthermore, 

limmf -log PIy{D{Pxiy\\Px\y)>S) > -5. 



n— >oo n 



(74) 



(75) 



It is worth noting that the bounds in (|74p and (j75p are independent of the distribution Px,y 
(cf. discussion after Theorem [8]). We now proceed with the proo f of Theorem [HI To do so, 
we consider the directed representation of a tree distribution Q (jLauritzenl . Il996l ) : 



<5(X) = YlQi\7T(i){Xi\x^(i)] 



(76) 



iev 



where 7r(i) is the parent of i in the edge set of Q (assuming a fixed root). Using (I76p and 
conditioned on the fact that the top k edges of the graph of P* are the same as those in Ep 
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(event yB^) and underestimation does not occur (event Z^^), the KL-divergence between P* 
(which is a function of the samples x" and hence of n) and P can be expressed as a sum 
over d terms: 

D(P*||P) = 5^D(F^^^^ ||P,|.(,^,)), (77) 

i&V 

where the parent of node i in E-r , denoted TT(i;E7: ), is defined by arbitrarily choosing a 

root in each component tree of the forest Tt = (V,Et ). The parents of the chosen roots 

are empty sets. The parent of node i in Ep are "matched" to those in E-r , i.e., defined as 

Tr(i;Ep) := -Ki^Er ) if (i,iT(i;E'r )) £ Ep and TT(i;Ep) := otherwise. See Figure [TOl for 

an example. Note that this can be done because E-r ^ Ep by conditioning on the events 

S^ and U^ = {kn > k}. Then, the error probability P'^{Cn,5\B'^,l^n) ^^ dZSl) can be upper 
bounded as 

P^{Cn,5\B'n,K) = ^" (E^(^.k(.;%jll^*K(*;^p)) > '^^l^-^n ) (78) 

ViGV 

- ^" {Tv {^^^^\A^■,E,J\P^\A^■,E.))} > ^1^^,^^ (79) 

< 5^P" (^(^.|.(,g,jll^.K(.i.p)) > s\K,K) (80) 

< ^(n + ly' exp (-nJ) = d(n + l)*"' exp {-n6) , (81) 

where Eq. (j78p follows from the decomposition in (j77p . Eq. (j79|) follows from the fact that if 
the arithmetic mean of d positive numbers exceeds d, then the maximum exceeds 5. Eq. ()80p 
follows from the union bound. Eq. (j8ip . which holds for all n € N, follows from the upper 
bound in ()74p . Combining (I73p and ()8ip shows that if 5 < min{i^p,Lp}, 

limsup-logP"(C„,5) < -6. 

n—^oo fl 

Now recall that Cn,s = {D(P* || P) > 6d}. In order to complete the proof of (p2|) . we need 
to swap the arguments in the KL-divergence to bound the probability of the event Cn,s = 
{D{P\\P*) > 5d}. To this end, note that for e > and n sufficiently large, \D{P* \\P) - 
D{P 1 1 P*)| < e with high probability since the two KL-divergences become close (P* ~ P 
w.h.p. as n ^^ oo). More precisely, the probability of {\D{P* \\P) — D{P\\P*)\ > e} = 
{o(||P — P*||^) > e} decays exponentially with some rate Mp > 0. Hence, 

lim sup- log P"(Z)(P IIP*) >6d) < -6, (82) 

n—^oo ri 

if 5 < inm{Kp,Lp,Mp}. If P is not Markov on a forest, ()82p holds with the forest 
projection P in place of P, i.e., 

limsup- log P"(P>(P IIP*) >6d) < -6. (83) 

n— >oo n 
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The Pythagorean relationship ( Simorj . Il973l : iBach and Jordanl . l2003l ) states that 

D{P\\P*) = D{P\\P) + D{P\\P*) (84) 

which means that the risk is TZn{P*) = D{P\\P*). Combining this fact with (183p imphes 
the assertion of (p2]) by choosing 5q := minjifp, Lp, Afp}. 

Now we exploit the lower bound in Lemma [14] to prove the lower bound in Theorem [8l 
The error probability P"^ {Cn^sWnMn) i'^ (|73|) can now be lower bounded by 

(85) 
(86) 



P'-{Cn,5WnM'n) > max P" (D{P..^.J^^ . W P^.(i;E,)) > ^ 6^,^ 

> exp(— n((5d + r/)), 



where (j85p follows from the decomposition in (j78p and (j86p holds for every ry for sufficiently 
large n by (I75p . Using the same argument that allows us to swap the arguments of the 
KL-divergence as in the proof of the upper bound completes the proof of ([23P . ■ 



Proof {of Lemma\14\) Define the 6 -conditional-typical set with respect to Px,y G V{X'^) as 
Sk,Y ■■= {(^"'2/") e (-^'r : DiPxiY \\Px\y) < S}, 

where Px\y is the conditional type of {x^,y'^). We now estimate the PjJ y-probability of 
the 5-conditional-atypical set, i.e., P^yii^p Y) 

Px,y{{Sp- 



^'^x,yT) 



E ^x,y((x",2/")) 

E ^x,y(Tn(Qx,y)) 

Qx,y&Vu(X-^).D(Qx\y\\Px\y)>& 

< E exp{-nD{Qx,Y\\Px,Y)) 

Qx,YeVn{X'^y.D{QxlY\\Px\Y)>S 

< E (^w{-nD{Qx\Y\\Px\Y)) 

Qx.Y&Vn{X2y.D{QxlY\\Px\Y)>S 

< y^ exp(— nJ) 

Qx.YeVn{X'^):D{QxlY\\Px\Y)>S 
2 

< (n + l)*" exp(— n(5). 



(87) 



(90) 
(91) 
(92) 



where (|87P and (|88|) are the same because summing over sequences is equivalent to summing 
over the cor responding type classes s ince every sequence in each type class has the same 
probability ( Cover and Thomad . l2006l . Ch. 11). Eq. ()89p follows from the method of types 
result in Lemma [T2l Eq. (j90p follows from the KL-divergence version of the chain rule, 
namely, 

D{Qx,y\\Px,y) = D{Qx\y\\Px\y) + D{Qy\\Py) 

and non-negativity of the KL-divergence D{Qy \\ Py)- Eq. (I9ip follows from the fact that 
D{Qx\Y \\Px\y) > ^ for Qx,Y G {Sp^^y. Finally, ([92|) follows the fact that the number of 
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types with denominator n and alphabet X'^ is upper bounded by (n + 1)'' . This concludes 
the proof of (fTil) . 

We now prove the lower bound in (|75]l . To this end, construct a sequence of distributions 
{QxY ^ 'Pn('^^)}neN such that Qy = Py and D{Qy Lr\\Px\Y) ~^ ^- Such a sequence exists 
by the denseness of types in the probability simplex ( Dembo and Zeitounil . 119981 . Lemma 
2.1.2(b)). Now we lower bound 



PlviiS'p^^rr) > PlviJuiQ^gy)) > (n + l)-^' eM-nD{Q% \\Px,y)). (93) 
Taking the normalized logarithm and liminf in n on both sides of (193p yields 

\\ra\niUogPly{{Sl,^^r)>\iriimi{-D{Qf \\Px\y) - D{qP\\Py)] = -5. 
This concludes the proof of Lemma \TM ■ 

Appendix H. Proof of Corollary [9] 

Proof If the dimension d = o{exp{n5)), then the upper bound in (j8ip is asymptotically 
majorized by poly(n)o(exp(na)) exp(— n(5) = o(exp(n5)) exp(— niJ), which can be made ar- 
bitrarily small for n sufficiently large. Thus the probability tends to zero as ra — )• oo. ■ 

Appendix I. Proof of Theorem 1101 

Proof In this proof, we drop the superscript (d) for all distributions P for notational 
simplicity but note that d = dn- We first claim that D{P* \\ P) = Op{dlogd/n^^^). Note 
from (f73]l and (|8T]) that by taking S = {Tlogd)/n^~'^ (for any r > 0), 

^" I 4^-^^(^*II^) >^ I <d(n + l)''%xp(-rn^logd)+exp(-e(n)) = o„(l). (94) 
\ d log d J 

1 — 7 ^~' 

Therefore, the scaled sequence of random variables j\^^D{P* 1 1 -P) is stochastically bounded 
( Serfling] . ll98Cl ) which proves the claimj^^l 



Now, we claim that D{P \\ P*) = O^jdlogd/n} '^). A simple calculation using Pinsker's 
Inequality and Lemma 6.3 in lCsiszar and Talatal ( 20061 ) yields 



D{Px,y\\Pxx) <-D{Px,y\\Px,y), 

K 

where k := mirix^y Pxxi^^v) ^^^ ^ ~ 2 log 2. Using this fact, we can use ([7¥|) to show that 
for all n sufficiently large, 

PIy{D{Px\y\\Px\y)>S) < {n + lY\xp{-n6f,/c), 



18. In fact, we have in fact proven the stronger assertion that D{P* \\P) — Op{dlogd/n^ ^) since the 
right-hand-side of (|94|) converges to zero. 
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i.e., if the arguments in the KL-divergence in (j74p are swapped, then the exponent is reduced 
by a factor proportional to k. Using this fact and the assumption in (jlSp (uniformity of the 
minimum entry in the pairwise joint k > 0), we can replicate the proof of the result in ()8ip 
with 5k/ c in place of 5 giving 

P'^{D{P\\P*) >6)< d{n + 1)^' exp {-ndn/c) . 

We then arrive at a similar result to ()94p by taking 5 = (Tlogd)/n^~^ . We conclude that 
D(P II P*) = Op(d log d/n^~'^). This completes the proof of the claim. 

Eq. (I24p then follows from the definition of the risk in ()20p and from the Pythagorean 
theorem in (|84p . This implies the assertion of Theorem [TUl ■ 
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